\documentclass{article}

\usepackage[left=2cm, top=1cm, right=2cm, nohead, nofoot]{geometry}

\author{Eran Chinthaka, Michael Conover, Michel Salim}

\title{Blog Mining: Project Progress Report}

\begin{document}
\bibliographystyle{plain}

\maketitle

\section{Implementation Details}
The project contains three basic modules, namely; crawler to gather links
structure and the content of blogs, persistence layer, implemented using a
database, to store all the collected data and a framework which runs on the
collected data to do predictions based on links and content-based similarity.

\subsection{Crawler}
We extended the java crawler implementation \cite{JavaCrawler} to crawl only the
blogs and extracted blog links information, together with the content of them. 
Content might later be used to do a content-based similarity checking to verify
the predictions. 

We evaluated multiple criterias to be used during the blog detection
process and all these criterias were derived from the patterns
identified using currently active blogs. Some of the criterias include
prediction based on the URLs, the presence of RSS feeds and the
content of those pages.

\subsection{Persistence Layer}
We implemented a persistance layer to store and retrieve collected data on top
of a database. Currently we rely on MySQL but we have no hard dependency on that
database. 

\subsection{Link-based similarity}
We evaluated various algorithms which can be used in the area of blog
detection and currently have two algorithms implemented on top of the
JUNG \cite{Jung} framework.

\begin{itemize}
\item A friend-of-a-friend (FOAF) algorithm that operates on a
  subgraph consisting of the starting node, nodes connected by its
  outlinks (D=1) and the nodes connected from those nodes (D=2). We
  are considering extending this to also include the parents of the
  D=1 set, or even all the grandparents of D=2, on the intuition that
  nodes with high correferentiality are more likely to be similar

\item the HITS algorithm provided by the Jung framework. This
  currently does not provide very interesting results; most likely
  because of weaknesses in the node selection process for the initial
  subgraph (see above)
\end{itemize}

We use Network Workbench to visually represent the topology of our
subgraphs and are finding sites strongly correlated with the starting
node, which remain to be reliably extracted.

\subsection{Content-based similarity}
Our intention is to use content-based measures to approximately
measure the performance level of our link-based measures. We are
currently looking at the Word Vector Tool (WVTool) and Apache Lucene.

\bibliography{web-mining}

\end{document}
